-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multipart uploads with checksums on object locked buckets #6794
Conversation
So there are a couple of issues with how multi-part upload is implemented:
So this means that we need to keep the checksums in the |
5537c29
to
74f0c9c
Compare
I'm not sure about this, an AWS specific detail now leaks into the public trait for all stores... It also is not immediately obvious when the checksums are needed or not... I think this needs a bit more thought, I'll try to find some time in the next few weeks |
FWIW: I verified in the new tests we are now writing the checksums at the correct times for when the digest functionality is enabled. Now I am just trying to get the original tests working for when it isn't. |
The AWS docs are not very helpful here but we pieced together that to write to an S3 bucket using multi-part with object locking enabled you need to:
The current implementation does 2 but not 1 or 3. |
I'd like to re-iterate that this PR is functionally complete. Although I am unsure about the docs, I am sure it works. Now I'm just trying to fix the existing |
Sorry, by current implementation I meant what is on Was just trying to clarify for @tustvold based on his earlier comment |
Right, my comment pertained to the changes to the public API, we now have a somewhat peculiar quirk of AWS leaking out into the public API in a manner that also makes it very unclear when or if I need to specify the checksum or not. I need to find some time to sit down and think about how we could support this, ideally without making a breaking API change. |
So I think there is an less intrusive way to support this. UploadPart is not an issue as the per-part meta is never exposed. The issue solely concerns MultipartStore, in particular how to carry the checksum through https://docs.rs/object_store/latest/object_store/multipart/struct.PartId.html. Now we could add a checksum field to this, but this would be a breaking change. However, the content id is intentionally opaque. We could therefore just use the XML encoded https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompletedPart.html instead of an ETag when necessary. There would need to still be logic to handle an ETag based PartId, as these may have been stored in a database, but detecting this should be trivial e.g. by attempting to decode the XML. This would avoid needing to make any breaking changes, whilst also being future proof for any of the other checksum types |
fc51b10
to
6757413
Compare
@tustvold done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important we don't add implicit checksum enablement, not only because this could break workloads (e.g. those that aren't actually S3), but also because S3 recently added a load of new checksum algorithms
ba9e11c
to
faef21f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits
d655774
to
0605e23
Compare
@tustvold some unrelated tests started failing. Do you know what's up? https://github.com/apache/arrow-rs/actions/runs/12221975779/job/34091689971?pr=6794 |
d655774
to
54fc547
Compare
54fc547
to
9839874
Compare
Which issue does this PR close?
Closes #6793.
Rationale for this change
Described in issue
What changes are included in this PR?
A failing test
Are there any user-facing changes?
Multipart uploads fail